Word Length n-Grams for Text Re-use Detection
نویسندگان
چکیده
The automatic detection of shared content in written documents –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.
منابع مشابه
A Study Using n-gram Features for Text Categorization
In this paper, we study the effect of using n-grams (sequences of words of length n) for text categorization. We use an efficient algorithm for generating such n-gram features in two benchmark domains, the 20 newsgroups data set and 21,578 REUTERS newswire articles. Our results with the rule learning algorithm R IPPER indicate that, after the removal of stop words, word sequences of length 2 or...
متن کاملCategory-based Statistical Language Models Synopsis
Language models are computational techniques and structures that describe word sequences produced by human subjects, and the work presented here considers primarily their application to automatic speech-recognition systems. Due to the very complex nature of natural languages as well as the need for robust recognition, statistically-based language models, which assign probabilities to word seque...
متن کاملExploring Lexical and Syntactic Features for Language Variety Identification
We present a method to discriminate between texts written in either the Netherlandic or the Flemish variant of the Dutch language. The method draws on a feature bundle representing text statistics, syntactic features, and word n-grams. Text statistics include average word length and sentence length, while syntactic features include ratios of function words and partof-speech n-grams. The effecti...
متن کاملA Statistical Approach to Automatic OCR Error Correction in Context
This paper describes an automatic, context-sensitive, word-error correction system based on statistical language modeling (SLM) as applied to optical character recognition (OCR) postprocessing. The system exploits information from multiple sources, including letter n-grams, character confusion probabilities, and word-bigram probabilities. Letter n-grams are used to index the words in the lexico...
متن کاملA New Method of Region Embedding for Text Classification
To represent a text as a bag of properly identified “phrases” and use the representation for processing the text is proved to be useful. The key question here is how to identify the phrases and represent them. The traditional method of utilizing n-grams can be regarded as an approximation of the approach. Such a method can suffer from data sparsity, however, particularly when the length of n-gr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010